Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Open source optical character recognition for historical research

Identifieur interne : 000303 ( Main/Exploration ); précédent : 000302; suivant : 000304

Open source optical character recognition for historical research

Auteurs : Tobias Blanke [Royaume-Uni] ; Michael Bryant [Royaume-Uni] ; Mark Hedges [Royaume-Uni]

Source :

RBID : Pascal:13-0104039

Descripteurs français

English descriptors

Abstract

Purpose - This paper aims to present an evaluation of open source OCR for supporting research on material in small- to medium-scale historical archives. Design/methodology/approach The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large-scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings - The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high-quality research-oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value - There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre-processing and layout analysis. All this can be done without the need to develop dedicated code.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Open source optical character recognition for historical research</title>
<author>
<name sortKey="Blanke, Tobias" sort="Blanke, Tobias" uniqKey="Blanke T" first="Tobias" last="Blanke">Tobias Blanke</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Bryant, Michael" sort="Bryant, Michael" uniqKey="Bryant M" first="Michael" last="Bryant">Michael Bryant</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Hedges, Mark" sort="Hedges, Mark" uniqKey="Hedges M" first="Mark" last="Hedges">Mark Hedges</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">13-0104039</idno>
<date when="2012">2012</date>
<idno type="stanalyst">PASCAL 13-0104039 INIST</idno>
<idno type="RBID">Pascal:13-0104039</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000059</idno>
<idno type="stanalyst">FRANCIS 13-0104039 INIST</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000074</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000709</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000065</idno>
<idno type="wicri:doubleKey">0022-0418:2012:Blanke T:open:source:optical</idno>
<idno type="wicri:Area/Main/Merge">000306</idno>
<idno type="wicri:Area/Main/Curation">000303</idno>
<idno type="wicri:Area/Main/Exploration">000303</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Open source optical character recognition for historical research</title>
<author>
<name sortKey="Blanke, Tobias" sort="Blanke, Tobias" uniqKey="Blanke T" first="Tobias" last="Blanke">Tobias Blanke</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Bryant, Michael" sort="Bryant, Michael" uniqKey="Bryant M" first="Michael" last="Bryant">Michael Bryant</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Hedges, Mark" sort="Hedges, Mark" uniqKey="Hedges M" first="Mark" last="Hedges">Mark Hedges</name>
<affiliation wicri:level="3">
<inist:fA14 i1="01">
<s1>Centre for e-Research, King's College London</s1>
<s2>London</s2>
<s3>GBR</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Londres</settlement>
<region type="country">Angleterre</region>
<region type="région" nuts="1">Grand Londres</region>
</placeName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Journal of documentation</title>
<title level="j" type="abbreviated">J. doc.</title>
<idno type="ISSN">0022-0418</idno>
<imprint>
<date when="2012">2012</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Journal of documentation</title>
<title level="j" type="abbreviated">J. doc.</title>
<idno type="ISSN">0022-0418</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Archive</term>
<term>Archives</term>
<term>Collection</term>
<term>Electronic library</term>
<term>Open source</term>
<term>Optical character recognition</term>
<term>Workflow</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance optique caractère</term>
<term>Workflow</term>
<term>Collection</term>
<term>Archives</term>
<term>Archive</term>
<term>Bibliothèque électronique</term>
<term>Open source</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Archives</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Purpose - This paper aims to present an evaluation of open source OCR for supporting research on material in small- to medium-scale historical archives. Design/methodology/approach The approach was to develop a workflow engine to support the easy customisation of the OCR process towards the historical materials using open source technologies. Commercial OCR often fails to deliver sufficient results here, as their processing is optimised towards large-scale commercially relevant collections. The approach presented here allows users to combine the most effective parts of different OCR tools. Findings - The authors demonstrate their application and its flexibility and present two case studies, which demonstrate how OCR can be embedded into wider digitally enabled historical research. The first case study produces high-quality research-oriented digitisation outputs, utilizing services that the authors developed to allow for the direct linkage of digitisation image and OCR output. The second case study demonstrates what becomes possible if OCR can be customised directly within a larger research infrastructure for history. In such a scenario, further semantics can be added easily to the workflow, enhancing the research browse experience significantly. Originality/value - There has been little work on the use of open source OCR technologies for historical research. This paper demonstrates that the authors' workflow approach allows users to combine commercial engines' ability to read a wider range of character sets with the flexibility of open source tools in terms of customisable pre-processing and layout analysis. All this can be done without the need to develop dedicated code.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Royaume-Uni</li>
</country>
<region>
<li>Angleterre</li>
<li>Grand Londres</li>
</region>
<settlement>
<li>Londres</li>
</settlement>
</list>
<tree>
<country name="Royaume-Uni">
<region name="Angleterre">
<name sortKey="Blanke, Tobias" sort="Blanke, Tobias" uniqKey="Blanke T" first="Tobias" last="Blanke">Tobias Blanke</name>
</region>
<name sortKey="Bryant, Michael" sort="Bryant, Michael" uniqKey="Bryant M" first="Michael" last="Bryant">Michael Bryant</name>
<name sortKey="Hedges, Mark" sort="Hedges, Mark" uniqKey="Hedges M" first="Mark" last="Hedges">Mark Hedges</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000303 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000303 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:13-0104039
   |texte=   Open source optical character recognition for historical research
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024